• Linear Regression: Standard linear regression models a relationship between a dependent variable (y) and an independent variable (x) as a straight line:

y = β₀ + β₁x

Where:

β₀ is the intercept.

β₁ is the slope.

  • Introducing the Quadratic Term: Quadratic regression extends linear regression by adding a squared term of the independent variable (x²):

y = β₀ + β₁x + β₂x²

Where:

β₂ is the coefficient of the squared term.

The Curve:

The x² term introduces a curve into the relationship.

If β₂ is positive, the curve opens upward (like a U).

If β₂ is negative, the curve opens downward (like an inverted U).

1 Sheet 1

1.1 What is the relationship between population and IGF revenue performance patterns?

# Descriptive statistics
Cleaned_Accra_MMDAs_Data %>% skim(Population)
Data summary
Name Piped data
Number of rows 134
Number of columns 76
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Population 1 0.99 169676.7 85308.29 53004 94831 149248 223619 425518 ▇▇▅▂▁
Cleaned_Accra_MMDAs_Data %>% skim(IGF)
Data summary
Name Piped data
Number of rows 134
Number of columns 76
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
IGF 5 0.96 3991084 3516693 23236 1394723 2977112 4969326 16317055 ▇▅▁▁▁
# Histograms
ggplot(Cleaned_Accra_MMDAs_Data, aes(x = Population)) +
  geom_histogram(bins = 10, fill = "dodgerblue", color = "black") +
  labs(title = "Distribution of Population", x = "Population", y = "Frequency") +
  scale_x_continuous(labels = comma)

ggplot(Cleaned_Accra_MMDAs_Data, aes(x = IGF)) +
  geom_histogram(bins = 10, fill = "dodgerblue", color = "black") +
  labs(title = "Distribution of IGF Revenue", x = "IGF Revenue", y = "Frequency") +
  scale_x_continuous(labels = comma)

# Growth Rate (Percentage)
Cleaned_Accra_MMDAs_Data <- Cleaned_Accra_MMDAs_Data %>%
  mutate(
    Population_Growth_Rate = c(NA, diff(Population) / Population[-length(Population)] * 100),
    IGF_Growth_Rate = c(NA, diff(IGF) / IGF[-length(IGF)] * 100)
  )

# Plot of Trends






ggplot(Cleaned_Accra_MMDAs_Data, aes(x = Year, y = Population)) + 
  geom_point(color = "blue") +
  geom_smooth(method = "lm", se = TRUE, color = "red", linetype = "dashed") +
  labs(
    title = "Trends in Population Growth ",
    x = "Year (2012-2022)",
    y = "Population"
  ) +
  theme(plot.title = element_text(hjust = 0.5))+
  scale_y_continuous(labels = comma)

ggplot(Cleaned_Accra_MMDAs_Data, aes(x = Year, y = IGF)) + 
  geom_point(color = "blue") +
  geom_smooth(method = "lm", se = TRUE, color = "red", linetype = "dashed") +
  labs(
    title = "Trends in IGF Revenue (Ghana Cedis) Growth ",
    x = "Year (2012-2022)",
    y = "IGF Revenue (Ghana Cedis)"
  ) +
  theme(plot.title = element_text(hjust = 0.5))+
  scale_y_continuous(labels = comma)

ggplot(Cleaned_Accra_MMDAs_Data, aes(x = Population, y = IGF)) +
  geom_point(color = "blue") +
  labs( title = "Population vs. IGF Revenue",
        x = "population", y = "IGF Revenue (Ghana Cedis)") +
  theme(plot.title = element_text(hjust = 0.5))+
  scale_y_continuous(labels = comma)

The histograms show the distribution of population and IGF revenue are skewed to the right. The scatter plot show a positive relationship between population and IGF revenue. As population increases IGF revenue tends to increase.

1.1.1 Regression Analysis

mod1 <- lm(IGF ~ Population, data = Cleaned_Accra_MMDAs_Data)
summary(mod1)
## 
## Call:
## lm(formula = IGF ~ Population, data = Cleaned_Accra_MMDAs_Data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -6043138 -1775692  -977216   819612 11256715 
## 
## Coefficients:
##                Estimate  Std. Error t value   Pr(>|t|)    
## (Intercept) 1100581.153  652684.396   1.686     0.0942 .  
## Population       17.647       3.592   4.912 0.00000274 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3245000 on 126 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.1607, Adjusted R-squared:  0.1541 
## F-statistic: 24.13 on 1 and 126 DF,  p-value: 0.000002739
Cleaned_Accra_MMDAs_Data %>%
  ggplot(aes(x = Population, y = IGF)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE) + 
  labs(x = "Population", y = "IGF Revenue (Ghana Cedis)", title = "Linear Relationship between Population and IGF Revenue") + 
  scale_y_continuous(labels = scales::comma)

# The Quadratic Term
Cleaned_Accra_MMDAs_Data$Population_Squared <- Cleaned_Accra_MMDAs_Data$Population^2

#  Quadratic Regression
mod_quad <- lm(IGF ~ Population + Population_Squared, data = Cleaned_Accra_MMDAs_Data)

summary(mod_quad)
## 
## Call:
## lm(formula = IGF ~ Population + Population_Squared, data = Cleaned_Accra_MMDAs_Data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -6946145 -2128240  -639971  1104051 10445638 
## 
## Coefficients:
##                            Estimate       Std. Error t value Pr(>|t|)    
## (Intercept)        4509175.15965575 1282425.99841933   3.516 0.000611 ***
## Population             -24.91868849      14.36168841  -1.735 0.085191 .  
## Population_Squared       0.00010718       0.00003509   3.055 0.002753 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3143000 on 125 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.219,  Adjusted R-squared:  0.2065 
## F-statistic: 17.53 on 2 and 125 DF,  p-value: 0.0000001948
ggplot(Cleaned_Accra_MMDAs_Data, aes(x = Population, y = IGF)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ x + I(x^2), se = TRUE) + # Use formula for quadratic
  labs(x = "Population", y = "IGF Revenue (Ghana Cedis)", title = "Quadratic Relationship between Population and IGF Revenue") +
  scale_y_continuous(labels = comma)

Linear Regression:

Coefficients:

Intercept: 1100581.153

Population: 17.647 .

P-values: Intercept: 0.0942 (insignificant)

Population: 0.00000274 (significant)

R-squared: Multiple R-squared: 0.1607

Adjusted R-squared: 0.1541

Interpretation:

There is a statistically significant positive relationship between Population and IGF. As Population increases, IGF tends to increase. For each unit increase in population, IGF is predicted to increase by approximately 17.647 Ghana Cedis.

The linear model shows a statistically significant positive relationship between Population and IGF. But the Multiple R-squared = 0.1607 indicates Population explains only 16.07% of the variance in IGF. Adjusted R-squared = 0.1541 is low as well.

Quadratic Regression:

Coefficients: Intercept: 4509175.15965575

Population: -24.91868849

Population_Squared: 0.00010718

P-values: The coefficient for the population is the only statistically insignificant (p > 0.01) term, the others are significant. The overall model is statistically significant ( p-value = 0.0000001948).

R-squared: Multiple R-squared: 0.219

Adjusted R-squared: 0.2065

Interpretation: The quadratic model shows a statistically significant relationship between population and IGF revenue. A slight improvement of the R-squared (0.219).

  • Checking Regression Assumptions
# Residual
ggplot(data = data.frame(residuals = residuals(mod1), fitted = fitted(mod1)), aes(x = fitted, y = residuals)) +
  geom_point() + # Added geom_point()
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Residuals vs. Fitted (Linear) ", x = "Fitted Values", y = "Residuals")

ggplot(data = data.frame(residuals = residuals(mod1)), aes(x = residuals)) +
  geom_histogram(bins = 10, fill = "skyblue", color = "black") +
  labs(title = "Histogram of Residuals(Linear)", x = "Residuals")

ggplot(data = data.frame(residuals = residuals(mod1)), aes(sample = residuals)) +
  geom_point(stat = "qq") +
  stat_qq_line() +
  labs(title = "Q-Q Plot of Residuals")

#  Residuals vs. Fitted Values
ggplot(data = data.frame(residuals = residuals(mod_quad), fitted = fitted(mod_quad)), 
       aes(x = fitted, y = residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Residuals vs. Fitted (Quadratic Model)", x = "Fitted Values", y = "Residuals")

#  Histogram of Residuals
ggplot(data = data.frame(residuals = residuals(mod_quad)), aes(x = residuals)) +
  geom_histogram(bins = 10, fill = "skyblue", color = "black") +
  labs(title = "Histogram of Residuals (Quadratic Model)", x = "Residuals")

#  Q-Q Plot of Residuals
ggplot(data = data.frame(residuals = residuals(mod_quad)), aes(sample = residuals)) +
  geom_point(stat = "qq") +
  stat_qq_line() +
  labs(title = "Q-Q Plot of Residuals (Quadratic Model)")

shapiro.test(resid(mod1))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(mod1)
## W = 0.83278, p-value = 0.0000000000952
shapiro.test(resid(mod_quad))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(mod_quad)
## W = 0.87419, p-value = 0.000000005049
#  Durbin-Watson Test (Autocorrelation)
dwtest(mod1)
## 
##  Durbin-Watson test
## 
## data:  mod1
## DW = 0.62936, p-value = 0.000000000000001749
## alternative hypothesis: true autocorrelation is greater than 0
dwtest(mod_quad)
## 
##  Durbin-Watson test
## 
## data:  mod_quad
## DW = 0.64655, p-value = 0.000000000000002704
## alternative hypothesis: true autocorrelation is greater than 0
#  Breusch-Pagan Test (Homoscedasticity)
bptest(mod1)
## 
##  studentized Breusch-Pagan test
## 
## data:  mod1
## BP = 0.61495, df = 1, p-value = 0.4329
bptest(mod_quad)
## 
##  studentized Breusch-Pagan test
## 
## data:  mod_quad
## BP = 2.7143, df = 2, p-value = 0.2574
#  Variance Inflation Factor (VIF) - Multicollinearity
bptest(mod1)
## 
##  studentized Breusch-Pagan test
## 
## data:  mod1
## BP = 0.61495, df = 1, p-value = 0.4329
vif(mod_quad)
##         Population Population_Squared 
##           17.03933           17.03933

Both the linear and quadratic models violate the autocorrelation and normality assumptions. And in the quadratic model Multicollinearity assumption is not satisfied.

  • Transformations
# Transformed Model
log_log_mod <- lm(log(IGF) ~ log(Population), data = Cleaned_Accra_MMDAs_Data)
summary(log_log_mod)
## 
## Call:
## lm(formula = log(IGF) ~ log(Population), data = Cleaned_Accra_MMDAs_Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9681 -0.4194  0.0490  0.4937  2.2734 
## 
## Coefficients:
##                 Estimate Std. Error t value   Pr(>|t|)    
## (Intercept)       3.8304     2.1396   1.790     0.0758 .  
## log(Population)   0.9197     0.1799   5.114 0.00000114 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.989 on 126 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.1719, Adjusted R-squared:  0.1653 
## F-statistic: 26.15 on 1 and 126 DF,  p-value: 0.000001144
# Scatter Plots (Transformed Data)
ggplot(Cleaned_Accra_MMDAs_Data, aes(x = Ln_Pop, y = Ln_IGF)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(title = "Log(Population) vs. Log(IGF Revenue)", x = "Log(Population)", y = "Log(IGF Revenue)")

sqrt_model <- lm(sqrt(IGF) ~ Population, data = Cleaned_Accra_MMDAs_Data)
summary(sqrt_model)
## 
## Call:
## lm(formula = sqrt(IGF) ~ Population, data = Cleaned_Accra_MMDAs_Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1643.1  -475.0  -104.7   342.6  2275.8 
## 
## Coefficients:
##                 Estimate   Std. Error t value        Pr(>|t|)    
## (Intercept) 1088.5423351  151.8126481   7.170 0.0000000000561 ***
## Population     0.0044499    0.0008356   5.326 0.0000004469786 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 754.8 on 126 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.1837, Adjusted R-squared:  0.1773 
## F-statistic: 28.36 on 1 and 126 DF,  p-value: 0.000000447

After the log log and square root transformations the log log model show an improvement of the relationship than the simple linear model and the relationship is still significant (p-value: 0.000001144 and R-squared: 0.1719 ). The log model provides the best fit among the models so far.

# Function to perform diagnostic tests and plots
perform_diagnostics <- function(model, model_name) {
  # Residuals vs. Fitted
  plot1 <- ggplot(data = data.frame(residuals = residuals(model), fitted = fitted(model)),
                 aes(x = fitted, y = residuals)) +
    geom_point() +
    geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
    labs(title = paste("Residuals vs. Fitted (", model_name, ")"), x = "Fitted Values", y = "Residuals")

  # Histogram of Residuals
  plot2 <- ggplot(data = data.frame(residuals = residuals(model)), aes(x = residuals)) +
    geom_histogram(bins = 10, fill = "skyblue", color = "black") +
    labs(title = paste("Histogram of Residuals (", model_name, ")"), x = "Residuals")

  # Q-Q Plot of Residuals
  plot3 <- ggplot(data = data.frame(residuals = residuals(model)), aes(sample = residuals)) +
    geom_point(stat = "qq") +
    stat_qq_line() +
    labs(title = paste("Q-Q Plot of Residuals (", model_name, ")"))

  # Durbin-Watson Test
  dw_test <- dwtest(model)
  print(paste("Durbin-Watson Test (", model_name, "):"))
  print(dw_test)

  # Breusch-Pagan Test
  bp_test <- bptest(model)
  print(paste("Breusch-Pagan Test (", model_name, "):"))
  print(bp_test)

  # Print VIF (if applicable)
  if (length(coef(model)) > 2) { # Check for multiple predictors
    vif_result <- vif(model)
    print(paste("VIF (", model_name, "):"))
    print(vif_result)
  }

  # Arrange plots
  grid.arrange(plot1, plot2, plot3, nrow = 1)
}

# Perform diagnostics for each model
perform_diagnostics(mod1, "Linear Model")
## [1] "Durbin-Watson Test ( Linear Model ):"
## 
##  Durbin-Watson test
## 
## data:  model
## DW = 0.62936, p-value = 0.000000000000001749
## alternative hypothesis: true autocorrelation is greater than 0
## 
## [1] "Breusch-Pagan Test ( Linear Model ):"
## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 0.61495, df = 1, p-value = 0.4329

perform_diagnostics(log_log_mod, "Log-Log Model")
## [1] "Durbin-Watson Test ( Log-Log Model ):"
## 
##  Durbin-Watson test
## 
## data:  model
## DW = 0.92318, p-value = 0.0000000002708
## alternative hypothesis: true autocorrelation is greater than 0
## 
## [1] "Breusch-Pagan Test ( Log-Log Model ):"
## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 10.113, df = 1, p-value = 0.001473

perform_diagnostics(sqrt_model, "Square Root Model")
## [1] "Durbin-Watson Test ( Square Root Model ):"
## 
##  Durbin-Watson test
## 
## data:  model
## DW = 0.70331, p-value = 0.00000000000004657
## alternative hypothesis: true autocorrelation is greater than 0
## 
## [1] "Breusch-Pagan Test ( Square Root Model ):"
## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 1.1057, df = 1, p-value = 0.293

cor.test(Cleaned_Accra_MMDAs_Data$Population, Cleaned_Accra_MMDAs_Data$IGF)
## 
##  Pearson's product-moment correlation
## 
## data:  Cleaned_Accra_MMDAs_Data$Population and Cleaned_Accra_MMDAs_Data$IGF
## t = 4.9122, df = 126, p-value = 0.000002739
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2443784 0.5370741
## sample estimates:
##       cor 
## 0.4009076

Therefore from the analysis so far we found a strong and statistically significant positive linear relationship between population and IGF revenue. The population size correlated with IGF revenue performance but the relationship is not perfectly strong (Pearson’s product-moment correlation coefficient = 0.4009) . Some of the assumptions are not met even after the transformations.

1.2 What is the relationship between population and DACF revenue performance patterns?

Cleaned_Accra_MMDAs_Data %>% skim(Population)
Data summary
Name Piped data
Number of rows 134
Number of columns 79
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Population 1 0.99 169676.7 85308.29 53004 94831 149248 223619 425518 ▇▇▅▂▁
Cleaned_Accra_MMDAs_Data %>% skim(DACF)
Data summary
Name Piped data
Number of rows 134
Number of columns 79
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
DACF 0 1 3167362 5537956 0 1584136 2410939 3468858 64171193 ▇▁▁▁▁
# Histograms
ggplot(Cleaned_Accra_MMDAs_Data, aes(x = Population)) +
  geom_histogram(bins = 10, fill = "dodgerblue", color = "black") +
  labs(title = "Distribution of Population", x = "Population")

ggplot(Cleaned_Accra_MMDAs_Data, aes(x = DACF)) +
  geom_histogram(bins = 10, fill = "dodgerblue", color = "black") +
  labs(title = "Distribution of DACF Revenue", x = "DACF Revenue")

# Plot of Trends






ggplot(Cleaned_Accra_MMDAs_Data, aes(x = Year, y = Population)) + 
  geom_point(color = "blue") +
  geom_smooth(method = "lm", se = TRUE, color = "red", linetype = "dashed") +
  labs(
    title = "Trends in Population Growth ",
    x = "Year (2012-2022)",
    y = "Population"
  ) +
  theme(plot.title = element_text(hjust = 0.5))+
  scale_y_continuous(labels = comma)

ggplot(Cleaned_Accra_MMDAs_Data, aes(x = Year, y = DACF)) + 
  geom_point(color = "blue") +
  geom_smooth(method = "lm", se = TRUE, color = "red", linetype = "dashed") +
  labs(
    title = "Trends in DACF Revenue (Ghana Cedis) Growth ",
    x = "Year (2012-2022)",
    y = "DACF Revenue (Ghana Cedis)"
  ) +
  theme(plot.title = element_text(hjust = 0.5))+
  scale_y_continuous(labels = comma)

ggplot(Cleaned_Accra_MMDAs_Data, aes(x = Population, y = DACF)) +
  geom_point(color = "blue") +
  labs( title = "Population vs. DACF Revenue",
        x = "population", y = "DACF Revenue (Ghana Cedis)") +
  theme(plot.title = element_text(hjust = 0.5))+
  scale_y_continuous(labels = comma)

The histograms show an uneven distribution of population and DACF revenue. Both are right skewed. There is a potential outlier in the DACF. The scatter plot show a weak relationship and does not appear to be a linear relationship between population and DACF revenue.

1.2.1 Regression Analysis

mod2 <- lm(DACF ~ Population, data = Cleaned_Accra_MMDAs_Data)
summary(mod2)
## 
## Call:
## lm(formula = DACF ~ Population, data = Cleaned_Accra_MMDAs_Data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3722082 -1560495  -635551   488389 60421667 
## 
## Coefficients:
##                Estimate  Std. Error t value Pr(>|t|)    
## (Intercept) 4154836.128 1075738.100   3.862 0.000176 ***
## Population       -5.735       5.669  -1.012 0.313501    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5556000 on 131 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.007754,   Adjusted R-squared:  0.0001798 
## F-statistic: 1.024 on 1 and 131 DF,  p-value: 0.3135
Cleaned_Accra_MMDAs_Data %>%
  ggplot(aes(x = Population, y = DACF)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE) + # Added confidence intervals
  labs(x = "Population", y = "DACF Revenue (Ghana Cedis)", title = "Linear Relationship between Population and DACF Revenue") +
  scale_y_continuous(labels = scales::comma)

#  Quadratic Regression
mod_quad <- lm(DACF ~ Population + Population_Squared, data = Cleaned_Accra_MMDAs_Data)

summary(mod_quad)
## 
## Call:
## lm(formula = DACF ~ Population + Population_Squared, data = Cleaned_Accra_MMDAs_Data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -4222408 -1696001  -495436   576125 59843864 
## 
## Coefficients:
##                            Estimate       Std. Error t value Pr(>|t|)   
## (Intercept)        6207131.11715897 2248941.43187925   2.760  0.00662 **
## Population             -30.97832015      24.94618474  -1.242  0.21654   
## Population_Squared       0.00006195       0.00005962   1.039  0.30071   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5554000 on 130 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.01593,    Adjusted R-squared:  0.0007872 
## F-statistic: 1.052 on 2 and 130 DF,  p-value: 0.3522

From the regression results there is no statistically significant linear relationship between population and DACF revenue performance patterns (p-value: 0.3135, R-squared: 0.007754, and Adjusted R-squared: 0.0001798). The Population coefficient is negative means a slight negative trend but it’s not statistically significant (p-value: 0.3135) . The quadratic model too is not significant.

  • Checking Regression Assumptions
#  Residual 
ggplot(data = data.frame(residuals = residuals(mod2),
                        fitted = fitted(mod2)),
       aes(x = fitted, y = residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Residuals vs. Fitted",
       x = "Fitted Values", y = "Residuals")

ggplot(data = data.frame(residuals = residuals(mod2)),
       aes(x = residuals)) +
  geom_histogram(bins = 10, fill = "skyblue", color = "black") +
  labs(title = "Histogram of Residuals", x = "Residuals")

ggplot(data = data.frame(residuals = residuals(mod2)),
       aes(sample = residuals)) +
  stat_qq() +
  stat_qq_line() +
  labs(title = "Q-Q Plot of Residuals ")

shapiro.test(resid(mod2))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(mod2)
## W = 0.27024, p-value < 0.00000000000000022
# Autocorrelation
dwtest(mod2)
## 
##  Durbin-Watson test
## 
## data:  mod2
## DW = 2.0154, p-value = 0.511
## alternative hypothesis: true autocorrelation is greater than 0
# Homoscedasticity (Constant Variance of Residuals)

bptest(mod2)
## 
##  studentized Breusch-Pagan test
## 
## data:  mod2
## BP = 1.4442, df = 1, p-value = 0.2295
# Multicollinearity
#simple linear regression with one predictor(population), multicollinearity is not an issue.


# Multivariate Normality

#It is a simple linear regression with one predictor(population), multicollinearity therefore this is not an issue.

The test of the assumptions of linear regression show the residuals are not normally distributed all others are met.

  • Transformation.
#Transformed Models
Cleaned_Accra_MMDAs_Data$DACF_adjusted <- Cleaned_Accra_MMDAs_Data$DACF + 1
log_mod2 <- lm(log(DACF_adjusted) ~ log(Population), data = Cleaned_Accra_MMDAs_Data)
summary(log_mod2 )
# 
# Call:
# lm(formula = log(DACF_adjusted) ~ log(Population), data = Cleaned_Accra_MMDAs_Data)
# 
# Residuals:
#      Min       1Q   Median       3Q      Max 
# -14.3214  -0.2674   0.1536   0.4779   3.6785 
# 
# Coefficients:
#                 Estimate Std. Error t value Pr(>|t|)    
# (Intercept)      10.4161     2.9779   3.498 0.000641 ***
# log(Population)   0.3477     0.2497   1.393 0.166050    
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 1.447 on 131 degrees of freedom
#   (1 observation deleted due to missingness)
# Multiple R-squared:  0.01459, Adjusted R-squared:  0.007069 
# F-statistic:  1.94 on 1 and 131 DF,  p-value: 0.166
sqrt_mod2 <- lm( sqrt(DACF)~sqrt(Population), data = Cleaned_Accra_MMDAs_Data )  
summary(sqrt_mod2)
# 
# Call:
# lm(formula = sqrt(DACF) ~ sqrt(Population), data = Cleaned_Accra_MMDAs_Data)
# 
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -1653.2  -359.2   -73.4   234.7  6355.4 
# 
# Coefficients:
#                   Estimate Std. Error t value      Pr(>|t|)    
# (Intercept)      1716.7906   263.9496   6.504 0.00000000151 ***
# sqrt(Population)   -0.2314     0.6408  -0.361         0.719    
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 742.3 on 131 degrees of freedom
#   (1 observation deleted due to missingness)
# Multiple R-squared:  0.0009948,   Adjusted R-squared:  -0.006631 
# F-statistic: 0.1304 on 1 and 131 DF,  p-value: 0.7185
#  Scatter Plots (Transformed Data)
ggplot(Cleaned_Accra_MMDAs_Data, aes(x = log(Population), y = log(DACF))) +
  geom_point() +
  geom_smooth(method = "lm")+
  labs(title = "Log(Population) vs. Log(DACF Revenue)",
       x = "Log(Population)", y = "Log(DACF Revenue)")

ggplot(Cleaned_Accra_MMDAs_Data, aes(x = log(Population), y = log(DACF))) +
  geom_point() +
  geom_smooth(method = "lm")+
  labs(title = "Sqrt(Population) vs. Sqrt(DACF Revenue)",
       x = "Sqrt(Population)", y = "Sqrt(DACF Revenue)")

# Function to perform diagnostic tests and plots
perform_diagnostics <- function(model, model_name) {
  # Residuals vs. Fitted
  plot1 <- ggplot(data = data.frame(residuals = residuals(model), fitted = fitted(model)),
                 aes(x = fitted, y = residuals)) +
    geom_point() +
    geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
    labs(title = paste("Residuals vs. Fitted (", model_name, ")"), x = "Fitted Values", y = "Residuals")

  # Histogram of Residuals
  plot2 <- ggplot(data = data.frame(residuals = residuals(model)), aes(x = residuals)) +
    geom_histogram(bins = 10, fill = "skyblue", color = "black") +
    labs(title = paste("Histogram of Residuals (", model_name, ")"), x = "Residuals")

  # Q-Q Plot of Residuals
  plot3 <- ggplot(data = data.frame(residuals = residuals(model)), aes(sample = residuals)) +
    geom_point(stat = "qq") +
    stat_qq_line() +
    labs(title = paste("Q-Q Plot of Residuals (", model_name, ")"))

  # Durbin-Watson Test
  dw_test <- dwtest(model)
  print(paste("Durbin-Watson Test (", model_name, "):"))
  print(dw_test)

  # Breusch-Pagan Test
  bp_test <- bptest(model)
  print(paste("Breusch-Pagan Test (", model_name, "):"))
  print(bp_test)

  # Print VIF (if applicable)
  if (length(coef(model)) > 2) { # Check for multiple predictors
    vif_result <- vif(model)
    print(paste("VIF (", model_name, "):"))
    print(vif_result)
  }

  # Arrange plots
  grid.arrange(plot1, plot2, plot3, nrow = 1)
}

# Perform diagnostics for each model
perform_diagnostics(mod2, "Linear Model")
## [1] "Durbin-Watson Test ( Linear Model ):"
## 
##  Durbin-Watson test
## 
## data:  model
## DW = 2.0154, p-value = 0.511
## alternative hypothesis: true autocorrelation is greater than 0
## 
## [1] "Breusch-Pagan Test ( Linear Model ):"
## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 1.4442, df = 1, p-value = 0.2295

perform_diagnostics(log_mod2, "Log-Log Model")
## [1] "Durbin-Watson Test ( Log-Log Model ):"
## 
##  Durbin-Watson test
## 
## data:  model
## DW = 1.9062, p-value = 0.2699
## alternative hypothesis: true autocorrelation is greater than 0
## 
## [1] "Breusch-Pagan Test ( Log-Log Model ):"
## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 2.4445, df = 1, p-value = 0.1179

perform_diagnostics(sqrt_mod2, "Square Root Model")
## [1] "Durbin-Watson Test ( Square Root Model ):"
## 
##  Durbin-Watson test
## 
## data:  model
## DW = 1.7832, p-value = 0.09263
## alternative hypothesis: true autocorrelation is greater than 0
## 
## [1] "Breusch-Pagan Test ( Square Root Model ):"
## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 2.5715, df = 1, p-value = 0.1088

shapiro.test(resid(mod2))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(mod2)
## W = 0.27024, p-value < 0.00000000000000022
shapiro.test(resid(log_mod2))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(log_mod2)
## W = 0.46366, p-value < 0.00000000000000022
shapiro.test(resid(sqrt_mod2))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(sqrt_mod2)
## W = 0.67538, p-value = 0.0000000000000009358

Both the log-log and square root transformations are still statistically not significant. Though they have slightly improve the model and normality assumption is still a problem.

# Function to perform diagnostic tests and plots
perform_diagnostics <- function(model, model_name) {
  # Residuals vs. Fitted
  plot1 <- ggplot(data = data.frame(residuals = residuals(model), fitted = fitted(model)),
                 aes(x = fitted, y = residuals)) +
    geom_point() +
    geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
    labs(title = paste("Residuals vs. Fitted (", model_name, ")"), x = "Fitted Values", y = "Residuals")

  # Histogram of Residuals
  plot2 <- ggplot(data = data.frame(residuals = residuals(model)), aes(x = residuals)) +
    geom_histogram(bins = 10, fill = "skyblue", color = "black") +
    labs(title = paste("Histogram of Residuals (", model_name, ")"), x = "Residuals")

  # Q-Q Plot of Residuals
  plot3 <- ggplot(data = data.frame(residuals = residuals(model)), aes(sample = residuals)) +
    geom_point(stat = "qq") +
    stat_qq_line() +
    labs(title = paste("Q-Q Plot of Residuals (", model_name, ")"))

  # Durbin-Watson Test
  dw_test <- dwtest(model)
  print(paste("Durbin-Watson Test (", model_name, "):"))
  print(dw_test)

  # Breusch-Pagan Test
  bp_test <- bptest(model)
  print(paste("Breusch-Pagan Test (", model_name, "):"))
  print(bp_test)

  # Print VIF (if applicable)
  if (length(coef(model)) > 2) { # Check for multiple predictors
    vif_result <- vif(model)
    print(paste("VIF (", model_name, "):"))
    print(vif_result)
  }

  # Arrange plots
  grid.arrange(plot1, plot2, plot3, nrow = 1)
}

# Perform diagnostics for each model
perform_diagnostics(mod2, "Linear Model")
## [1] "Durbin-Watson Test ( Linear Model ):"
## 
##  Durbin-Watson test
## 
## data:  model
## DW = 2.0154, p-value = 0.511
## alternative hypothesis: true autocorrelation is greater than 0
## 
## [1] "Breusch-Pagan Test ( Linear Model ):"
## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 1.4442, df = 1, p-value = 0.2295

perform_diagnostics(log_mod2, "Log-Log Model")
## [1] "Durbin-Watson Test ( Log-Log Model ):"
## 
##  Durbin-Watson test
## 
## data:  model
## DW = 1.9062, p-value = 0.2699
## alternative hypothesis: true autocorrelation is greater than 0
## 
## [1] "Breusch-Pagan Test ( Log-Log Model ):"
## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 2.4445, df = 1, p-value = 0.1179

perform_diagnostics(sqrt_mod2, "Square Root Model")
## [1] "Durbin-Watson Test ( Square Root Model ):"
## 
##  Durbin-Watson test
## 
## data:  model
## DW = 1.7832, p-value = 0.09263
## alternative hypothesis: true autocorrelation is greater than 0
## 
## [1] "Breusch-Pagan Test ( Square Root Model ):"
## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 2.5715, df = 1, p-value = 0.1088

shapiro.test(resid(mod2))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(mod2)
## W = 0.27024, p-value < 0.00000000000000022
shapiro.test(resid(log_mod2))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(log_mod2)
## W = 0.46366, p-value < 0.00000000000000022
shapiro.test(resid(sqrt_mod2))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(sqrt_mod2)
## W = 0.67538, p-value = 0.0000000000000009358

From the regression analysis so all the models are statistically not significant and the normality assumption is not met. Given these models it cannot be concluded that changes in the population reliably predict changes in the DACF revenue performance and any observed pattern could likely be due to chance.

1.3 What is the relationship between population, recurerent and capital expenditure?

  • Descriptive Statistics

The recurrent expenditure is NA

Cleaned_Accra_MMDAs_Data %>% skim(Capital_Expenditure)
Data summary
Name Piped data
Number of rows 134
Number of columns 80
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Capital_Expenditure 6 0.96 3003494 2232785 0 1485120 2420991 4347636 14576636 ▇▅▁▁▁
# Capital Expenditure Histogram
cap_hist <- ggplot(Cleaned_Accra_MMDAs_Data, aes(x = Capital_Expenditure)) +
  geom_histogram(aes(y = ..density..), bins = 10, fill = "skyblue", color = "black") +
  geom_density(color = "red") +
  labs(title = "Distribution of Capital Expenditure", x = "Capital Expenditure (Ghana Cedis)", y = "Density") +
  scale_x_continuous(labels = comma) 



# Population Histogram
pop_hist <- ggplot(Cleaned_Accra_MMDAs_Data, aes(x = Population)) +
  geom_histogram(aes(y = ..density..), bins = 10, fill = "dodgerblue", color = "black") +
  geom_density(color = "red") +
  labs(title = "Distribution of Population", x = "Population", y = "Density") +
  scale_x_continuous(labels = comma) 

cap_hist

pop_hist

  • Trends
ggplot(Cleaned_Accra_MMDAs_Data, aes(x = Year, y = Population)) + 
  geom_point(color = "blue") +
  geom_smooth(method = "lm", se = TRUE, color = "red", linetype = "dashed") +
  labs(
    title = "Population Trend",
    x = "Year (2012-2022)",
    y = "Population"
  ) +
  theme(plot.title = element_text(hjust = 0.5))+
  scale_y_continuous(labels = comma)

ggplot(Cleaned_Accra_MMDAs_Data, aes(x = Year, y = Capital_Expenditure)) + 
  geom_point(color = "blue") +
  geom_smooth(method = "lm", se = TRUE, color = "red", linetype = "dashed") +
  labs(
    title = "Capital Expenditure Trend",
    x = "Year (2012-2022)",
    y = "Capital Expenditure"
  ) +
  theme(plot.title = element_text(hjust = 0.5))+
  scale_y_continuous(labels = comma)

ggplot(Cleaned_Accra_MMDAs_Data, aes(x = Population, y = Capital_Expenditure)) +
  geom_point(color = "blue") +
  labs( title = "Population vs. Capital Expenditure",
        x = "population", y = "Capital Expenditure (Ghana Cedis)") +
  theme(plot.title = element_text(hjust = 0.5))+
  scale_y_continuous(labels = comma) 

# Calculate Per Capita Values
Cleaned_Accra_MMDAs_Data$Capital_Exp_Per_Capita <- Cleaned_Accra_MMDAs_Data$Capital_Expenditure / Cleaned_Accra_MMDAs_Data$Population



# Per Capita Analysis 
average_capita <- mean(Cleaned_Accra_MMDAs_Data$Capital_Exp_Per_Capita)

ggplot(Cleaned_Accra_MMDAs_Data, aes(x = Year)) +
  geom_point(aes(y = Capital_Exp_Per_Capita, color = "Capital Exp. Per Capita"), color = "blue") +
  labs(title = "Capital Expenditure Per Capita Over Time", x = "Year (2012 - 2022) ", y = "Ghana Cedis Per Capita", color = "Type") +
  scale_y_continuous(labels = comma) 

1.3.1 Regression Results

mod3 <- lm(Capital_Expenditure ~ Population, data = Cleaned_Accra_MMDAs_Data)
summary(mod3)
## 
## Call:
## lm(formula = Capital_Expenditure ~ Population, data = Cleaned_Accra_MMDAs_Data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3048586 -1638009  -649028  1116008 11122904 
## 
## Coefficients:
##                Estimate  Std. Error t value  Pr(>|t|)    
## (Intercept) 2039122.193  444409.755   4.588 0.0000107 ***
## Population        5.602       2.300   2.436    0.0163 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2197000 on 125 degrees of freedom
##   (7 observations deleted due to missingness)
## Multiple R-squared:  0.04532,    Adjusted R-squared:  0.03768 
## F-statistic: 5.934 on 1 and 125 DF,  p-value: 0.01626
mod_cap <- lm(Capital_Expenditure ~ Population, data = Cleaned_Accra_MMDAs_Data)
summary(mod_cap)
## 
## Call:
## lm(formula = Capital_Expenditure ~ Population, data = Cleaned_Accra_MMDAs_Data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3048586 -1638009  -649028  1116008 11122904 
## 
## Coefficients:
##                Estimate  Std. Error t value  Pr(>|t|)    
## (Intercept) 2039122.193  444409.755   4.588 0.0000107 ***
## Population        5.602       2.300   2.436    0.0163 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2197000 on 125 degrees of freedom
##   (7 observations deleted due to missingness)
## Multiple R-squared:  0.04532,    Adjusted R-squared:  0.03768 
## F-statistic: 5.934 on 1 and 125 DF,  p-value: 0.01626
Cleaned_Accra_MMDAs_Data %>% 
  ggplot(aes(x = Population, y = Capital_Expenditure)) +
  geom_point()+
  geom_smooth(method = "lm", se = TRUE) + labs(x = "Population", y = "Capital Expenditure", title = "Linear Relationship Population and Capital Expenditure")+
   scale_y_continuous(labels = scales::comma)

From the linear regression results there is a significant positive linear relationship between Population and Capital Expenditure(p-value: 0.01626, R-squared: 0.0453). In the model population explains only as low as 4.53% of the variation in capital expenditure. Only a small portion of the variation in capital expenditure is explained by population.

  • Checking Regression Assumptions
# Diagnostic Function
perform_diagnostics <- function(model, model_name) {
  # Residuals vs. Fitted
  plot1 <- ggplot(data = data.frame(residuals = residuals(model), fitted = fitted(model)),
                 aes(x = fitted, y = residuals)) +
    geom_point() +
    geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
    labs(title = paste("Residuals vs. Fitted (", model_name, ")"), x = "Fitted Values", y = "Residuals")

  # Histogram of Residuals
  plot2 <- ggplot(data = data.frame(residuals = residuals(model)), aes(x = residuals)) +
    geom_histogram(bins = 10, fill = "skyblue", color = "black") +
    labs(title = paste("Histogram of Residuals (", model_name, ")"), x = "Residuals")

  # Q-Q Plot of Residuals
  plot3 <- ggplot(data = data.frame(residuals = residuals(model)), aes(sample = residuals)) +
    geom_point(stat = "qq") +
    stat_qq_line() +
    labs(title = paste("Q-Q Plot of Residuals (", model_name, ")"))

  # Durbin-Watson Test
  dw_test <- dwtest(model)
  print(paste("Durbin-Watson Test (", model_name, "):"))
  print(dw_test)

  # Breusch-Pagan Test
  bp_test <- bptest(model)
  print(paste("Breusch-Pagan Test (", model_name, "):"))
  print(bp_test)

  # Print VIF (if applicable)
  if (length(coef(model)) > 2) { # Check for multiple predictors
    vif_result <- vif(model)
    print(paste("VIF (", model_name, "):"))
    print(vif_result)
  }

  # Arrange plots
  grid.arrange(plot1, plot2, plot3, nrow = 1)
}

#  Perform Diagnostics
# Capital Expenditure
perform_diagnostics(mod3, "Capital Expenditure Model")
## [1] "Durbin-Watson Test ( Capital Expenditure Model ):"
## 
##  Durbin-Watson test
## 
## data:  model
## DW = 1.0642, p-value = 0.00000003825
## alternative hypothesis: true autocorrelation is greater than 0
## 
## [1] "Breusch-Pagan Test ( Capital Expenditure Model ):"
## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 0.057858, df = 1, p-value = 0.8099

shapiro.test(resid(mod3))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(mod3)
## W = 0.89149, p-value = 0.00000003736
# Recurrent Expenditure

From the linear model violates the autocorrelation and normality assumptions of linear regression

  • Quadratic model
Cleaned_Accra_MMDAs_Data$Recrrent_Expenditure_squared <- Cleaned_Accra_MMDAs_Data$Recrrent_Expenditure^2

Cleaned_Accra_MMDAs_Data$Capital_Expenditure_squared <- Cleaned_Accra_MMDAs_Data$Capital_Expenditure^2

mod_quad <- lm(cbind(Capital_Expenditure) ~ Population + Population_Squared, data = Cleaned_Accra_MMDAs_Data)

# View the summary
summary(mod_quad)
## 
## Call:
## lm(formula = cbind(Capital_Expenditure) ~ Population + Population_Squared, 
##     data = Cleaned_Accra_MMDAs_Data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2976473 -1658166  -611487  1134668 11190838 
## 
## Coefficients:
##                            Estimate       Std. Error t value Pr(>|t|)  
## (Intercept)        2421761.56908601  944502.61295085   2.564   0.0115 *
## Population               0.99419964      10.28986386   0.097   0.9232  
## Population_Squared       0.00001118       0.00002434   0.460   0.6467  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2204000 on 124 degrees of freedom
##   (7 observations deleted due to missingness)
## Multiple R-squared:  0.04694,    Adjusted R-squared:  0.03157 
## F-statistic: 3.054 on 2 and 124 DF,  p-value: 0.05074
#  Scatter Plots (Transformed Data)
ggplot(Cleaned_Accra_MMDAs_Data, aes(x = Population, y = Capital_Expenditure)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ x + I(x^2), se = TRUE) +
  labs(x = "Population", y = "Capital Expenditure (Ghana Cedis)", title = "Quadratic Relationship between Population and Capital Expenditure") +
  scale_y_continuous(labels = comma)

Quadratic model show no improvement of the relationship between population and capital expenditure. The overall p-value is still signifacant but the individual terms are not.

  • Transformations
# Log Transformation for Recurrent Expenditure 

Cleaned_Accra_MMDAs_Data$Capital_Expenditure_adjusted <- Cleaned_Accra_MMDAs_Data$Capital_Expenditure + 1
log_cap_mod <- lm(log(Capital_Expenditure_adjusted) ~ Population, data = Cleaned_Accra_MMDAs_Data)
summary(log_cap_mod)
## 
## Call:
## lm(formula = log(Capital_Expenditure_adjusted) ~ Population, 
##     data = Cleaned_Accra_MMDAs_Data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.3169  -0.3649   0.4767   1.1044   2.3706 
## 
## Coefficients:
##                 Estimate   Std. Error t value             Pr(>|t|)    
## (Intercept) 12.654142532  0.532370980  23.769 < 0.0000000000000002 ***
## Population   0.000008784  0.000002755   3.188              0.00181 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.632 on 125 degrees of freedom
##   (7 observations deleted due to missingness)
## Multiple R-squared:  0.07521,    Adjusted R-squared:  0.06781 
## F-statistic: 10.17 on 1 and 125 DF,  p-value: 0.001808
perform_diagnostics(log_cap_mod, "Log capital Expenditure Model")
## [1] "Durbin-Watson Test ( Log capital Expenditure Model ):"
## 
##  Durbin-Watson test
## 
## data:  model
## DW = 0.96763, p-value = 0.000000001577
## alternative hypothesis: true autocorrelation is greater than 0
## 
## [1] "Breusch-Pagan Test ( Log capital Expenditure Model ):"
## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 7.0361, df = 1, p-value = 0.007988

Cleaned_Accra_MMDAs_Data$Ln_Population <- log(Cleaned_Accra_MMDAs_Data$Population)
Cleaned_Accra_MMDAs_Data$Ln_Capital_Expenditure <- log(Cleaned_Accra_MMDAs_Data$Capital_Expenditure)



  

ggplot(Cleaned_Accra_MMDAs_Data, aes(x = log(Population), y = log(Capital_Expenditure))) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE)+
  labs(title = "Log(Population) vs. Log(Capital Expenditure)",
       x = "Log(Population)", y = "Log(Capital Expenditure)")

#  Square root transformation for Capital Expenditure
sqrt_cap_mod <- lm(sqrt(Capital_Expenditure) ~ Population, data = Cleaned_Accra_MMDAs_Data)
summary(sqrt_cap_mod)
## 
## Call:
## lm(formula = sqrt(Capital_Expenditure) ~ Population, data = Cleaned_Accra_MMDAs_Data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1391.79  -444.70   -45.67   440.98  2052.16 
## 
## Coefficients:
##                 Estimate   Std. Error t value             Pr(>|t|)    
## (Intercept) 1232.4143446  132.5196479    9.30 0.000000000000000587 ***
## Population     0.0021123    0.0006858    3.08              0.00255 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 655.1 on 125 degrees of freedom
##   (7 observations deleted due to missingness)
## Multiple R-squared:  0.07054,    Adjusted R-squared:  0.06311 
## F-statistic: 9.487 on 1 and 125 DF,  p-value: 0.002545
perform_diagnostics(sqrt_cap_mod, "Square root Capital Expenditure Model")
## [1] "Durbin-Watson Test ( Square root Capital Expenditure Model ):"
## 
##  Durbin-Watson test
## 
## data:  model
## DW = 1.0006, p-value = 0.00000000484
## alternative hypothesis: true autocorrelation is greater than 0
## 
## [1] "Breusch-Pagan Test ( Square root Capital Expenditure Model ):"
## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 4.9973, df = 1, p-value = 0.02539

From the transformations the recurrent expenditure model are still significant and met the assumptions but the capital expenditure have not.

From the regression analysis above the relationship between population and capital expenditure is positive linear and significant but weak. It has the Pearson’s product-moment correlation value = 0.2128856

1.4 What is the relationship between revenue growth and infrastructure delivery (Model)

Using total revenue growth rate and infrastructure delivery (capital expenditure per capita).

# Descriptive statistics
Cleaned_Accra_MMDAs_Data %>% skim(Capital_Exp_Per_Capita)
Data summary
Name Piped data
Number of rows 134
Number of columns 85
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Capital_Exp_Per_Capita 7 0.95 20.6 19.16 0 7.32 15.73 24.1 85.11 ▇▅▁▁▁
Cleaned_Accra_MMDAs_Data %>% skim(TtRev_Growth_Rate)
Data summary
Name Piped data
Number of rows 134
Number of columns 85
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
TtRev_Growth_Rate 17 0.87 2.44 163.39 -1726.55 2.57 14.95 27.69 89.91 ▁▁▁▁▇
# Histograms
ggplot(Cleaned_Accra_MMDAs_Data, aes(x = Capital_Exp_Per_Capita)) +
  geom_histogram(bins = 10, fill = "dodgerblue", color = "black") +
  labs(title = "Distribution of Capital expenditure per capita", x = "Capital expenditure per capita") +
  scale_x_continuous(labels = comma)

ggplot(Cleaned_Accra_MMDAs_Data, aes(x = TtRev_Growth_Rate)) +
  geom_histogram(bins = 10, fill = "dodgerblue", color = "black") +
  labs(title = "Distribution of Total Revenue Growth Rate", x = "Total revenue growth rate") 

The histograms show an uneven distribution .

1.4.1 Regression results

mod5 <- lm(Capital_Exp_Per_Capita ~ TtRev_Growth_Rate, data = Cleaned_Accra_MMDAs_Data)
summary(mod5)
## 
## Call:
## lm(formula = Capital_Exp_Per_Capita ~ TtRev_Growth_Rate, data = Cleaned_Accra_MMDAs_Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.244 -12.992  -5.474   3.061  63.001 
## 
## Coefficients:
##                    Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)       22.104404   1.846949  11.968 <0.0000000000000002 ***
## TtRev_Growth_Rate -0.003893   0.011173  -0.348               0.728    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.63 on 111 degrees of freedom
##   (21 observations deleted due to missingness)
## Multiple R-squared:  0.001092,   Adjusted R-squared:  -0.007907 
## F-statistic: 0.1214 on 1 and 111 DF,  p-value: 0.7282
ggplot(Cleaned_Accra_MMDAs_Data, aes(x = TtRev_Growth_Rate, y = Capital_Exp_Per_Capita)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE)+
  labs(title = "Revenue Growth vs. Capital Expenditure (Per Capita)",
       x = "Total Revenue Growth Rate (%)",
       y = "Capital Expenditure Per Capita")

The regression result show there no statistically significant relationship between total revenue growth rate and infrastructure delivery (capital expenditure per capita) with p-value (0.7282) is greater than 0.05 significance level. This means that changes in revenue growth do not significantly predict changes in capital expenditure per capita in this model. The R-squared (0.001092) indicates only 0.11% of the variation in capital expenditure per capita can be explained by revenue growth (total revenue growth rate)

1.5 What is the relationship between expenditure growth and infrastructure delivery?

  • Regression results using expenditure growth (Expenditure_Growth) and infrastructure delivery (capital expenditure per capita).
Cleaned_Accra_MMDAs_Data$Expenditure_Growth <- c(NA, diff(Cleaned_Accra_MMDAs_Data$Total_Expenditure) / Cleaned_Accra_MMDAs_Data$Total_Expenditure[-nrow(Cleaned_Accra_MMDAs_Data)]) * 100




  
  ggplot(Cleaned_Accra_MMDAs_Data, aes(x = Expenditure_Growth, y = Capital_Exp_Per_Capita)) +
    geom_point() + geom_smooth(method = "lm", se = TRUE)+
    labs(title = "Relationship Expenditure Growth vs. Capital Expenditure (Per Capita)",
         x = "Expenditure Growth Rate (%)",
         y = "Capital Expenditure Per Capita")

There is no statistically significant linear relationship.

2 SHEET 2

2.1 What is the relationship between allocative and funding decision-making and revenue patterns?

# no variables

2.2 What is the relationship between allocative decision-making and expenditure patterns?

  • No direct variables are available on this, some descriptive statistics of closely related are below
# Trends of Revenue and Expenditure over the years.


ggplot(Cleaned_Accra_MMDAs_Data, aes(x = Year, y = Total_Revenue)) + 
  geom_point(color = "blue") +
  geom_smooth(method = "lm", se = TRUE, color = "red", linetype = "dashed") +
  labs(title = "Total Revenue Trend",
       x = "Year (2012 - 2012)",
       y = "Amount (Ghana Cedis)") +
 scale_y_continuous(labels = comma) 

ggplot(Cleaned_Accra_MMDAs_Data, aes(x = Year, y = Total_Revenue)) +
  geom_bar(stat = "identity", fill = "dodgerblue") +
  labs(title = "Total Revenue Trend",
       x = "Year",
       y = "Amount (Ghana Cedis)") +
 scale_y_continuous(labels = comma) 

ggplot(Cleaned_Accra_MMDAs_Data, aes(x = Year, y = Total_Expenditure)) + 
  geom_point(color = "blue") +
  geom_smooth(method = "lm", se = TRUE, color = "red", linetype = "dashed") +
  labs(
    title = "Trends in Total Expenditure Growth ",
    x = "Year (2012-2022)",
    y = "Amount (Ghana Cedis)"
  ) +
  theme(plot.title = element_text(hjust = 0.5))+
  scale_y_continuous(labels = comma)

ggplot(Cleaned_Accra_MMDAs_Data, aes(x = Year, y = Total_Expenditure)) +
  geom_bar(stat = "identity", fill = "dodgerblue") +
  labs(title = "Total Expenditure Trend",
       x = "Year",
       y = "Amount (Ghana Cedis)") +
 scale_y_continuous(labels = comma) 

ggplot(Cleaned_Accra_MMDAs_Data, aes(x = Year)) +
  geom_point(aes(y = Total_Revenue, color = "Total Revenue")) +
  geom_point(aes(y = Total_Expenditure, color = "Total Expenditure")) +
  labs(title = "Revenue Vs. Expenditure Trends Over Years",
       x = "Year",
       y = "Amount (Ghana Cedis)", color = "Type") +
  scale_color_manual(values = c("Total Revenue" = "blue", "Total Expenditure" = "red")) +
  scale_y_continuous(labels = comma) 

ggplot(Cleaned_Accra_MMDAs_Data, aes(x = Total_Revenue, y = Total_Expenditure)) +
  geom_point(color = "blue") +
  labs( title = "Total Revenue  Vs. Total Expenditure (Ghana Cedis)",
        x = "Total Revenue", y = "Total Expenditure ") +
  theme(plot.title = element_text(hjust = 0.5))+
  scale_y_continuous(labels = comma) +
  scale_x_continuous(labels = comma) 

ggplot(Cleaned_Accra_MMDAs_Data, aes(x = Year)) +
  geom_point(aes(y = IGF, color = "IGF"), linewidth = 1) +
  geom_point(aes(y = DACF, color = "DACF"), linewidth = 1) +
  geom_point(aes(y = Capital_Expenditure, color = "Capital Expenditure"), linewidth = 1) +
  geom_point(aes(y = Others_Sources, color = "Other Sources"), linewidth = 1) +
  labs(
    title = "Revenue  Trends",
    x = "Year",
    y = "Amount (Ghana Cedis)",
    color = "Type"
  ) +
  scale_color_manual(
    values = c(
      "Total Revenue" = "#0000FF",  # Blue
      "Other Sources" = "#87CEEB",  # Light Blue
      "IGF" = "#00CD66",  # Green
      "DACF" = "#808080",  # Gray
      "Capital Expenditure" = "#9370DB",  # Purple
      "Total Expenditure" = "#FF0000",  # Red
      "Recurrent Expenditure" = "#FFD700"  # Yellow
    )
  ) +
  scale_y_continuous(labels = comma, breaks = seq(0, 60000000, 10000000)) + # Added breaks
  theme(
    legend.position = "right",
    legend.title = element_text(face = "bold"),
    plot.title = element_text(hjust = 0.5, face = "bold")
  )

# IGF to Total Expenditure Ratio 
ggplot(Cleaned_Accra_MMDAs_Data, aes(x = Year, y = IGF_TE)) +
  geom_point(size = 2.5) +
  labs(
    title = "IGF to Total Expenditure Ratio Over Years",
    x = "Year",
    y = "Ratio (IGF/Total Expenditure)"
  ) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) 

cor.test(Cleaned_Accra_MMDAs_Data$Total_Expenditure, Cleaned_Accra_MMDAs_Data$Total_Revenue)
## 
##  Pearson's product-moment correlation
## 
## data:  Cleaned_Accra_MMDAs_Data$Total_Expenditure and Cleaned_Accra_MMDAs_Data$Total_Revenue
## t = 42.566, df = 127, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9531005 0.9763956
## sample estimates:
##       cor 
## 0.9666943

2.3 What is the relationship between population trend, service delivery and revenue and expenditure patterns?

# Revenue Per Capita
Cleaned_Accra_MMDAs_Data$Total_Revenue_Per_Capita <- Cleaned_Accra_MMDAs_Data$Total_Revenue / Cleaned_Accra_MMDAs_Data$Population
Cleaned_Accra_MMDAs_Data$IGF_Per_Capita <- Cleaned_Accra_MMDAs_Data$IGF / Cleaned_Accra_MMDAs_Data$Population
Cleaned_Accra_MMDAs_Data$DACF_Per_Capita <- Cleaned_Accra_MMDAs_Data$DACF / Cleaned_Accra_MMDAs_Data$Population




ggplot(Cleaned_Accra_MMDAs_Data, aes(x = Year)) +
  geom_point(aes(y = IGF, color = "IGF"), linewidth = 1) +
  geom_point(aes(y = DACF, color = "DACF"), linewidth = 1) +
  geom_point(aes(y = Others_Sources, color = "Other Sources"), linewidth = 1) +
  labs(
    title = "Revenue  Trends",
    x = "Year",
    y = "Amount (Ghana Cedis)",
    color = "Type"
  ) +
  scale_color_manual(
    values = c(
      "Total Revenue" = "#0000FF",  # Blue
      "Other Sources" = "#87CEEB",  # Light Blue
      "IGF" = "#00CD66",  # Green
      "DACF" = "#808080",  # Gray
      "Capital Expenditure" = "#9370DB",  # Purple
      "Total Expenditure" = "#FF0000",  # Red
      "Recurrent Expenditure" = "#FFD700"  # Yellow
    )
  ) +
  scale_y_continuous(labels = comma, breaks = seq(0, 60000000, 10000000)) + # Added breaks
  theme(
    legend.position = "right",
    legend.title = element_text(face = "bold"),
    plot.title = element_text(hjust = 0.5, face = "bold")
  )

# Population Trend


ggplot(Cleaned_Accra_MMDAs_Data, aes(x = Year, y = Total_Expenditure)) +
  geom_bar(stat = "identity", fill = "dodgerblue") +
  geom_point()+
  labs(title = "Total Expenditure Trend",
       x = "Year",
       y = "Amount (Ghana Cedis)") +
 scale_y_continuous(labels = comma)

ggplot(Cleaned_Accra_MMDAs_Data, aes(x = Year, y = Population)) +
  geom_bar(stat = "identity", fill = "dodgerblue") +
  geom_point()+
  labs(title = "Population Trend",
       x = "Year",
       y = "Population") 

ggplot(Cleaned_Accra_MMDAs_Data, aes(x = Year, y = IGF)) +
  geom_bar(stat = "identity", fill = "dodgerblue") +
  geom_point()+
  labs(title = "IGF Trend",
       x = "Year",
       y = "IGF") +
  scale_y_continuous(labels = comma) 

# Per capita plot
ggplot(Cleaned_Accra_MMDAs_Data, aes(x = Year)) +
  geom_line(aes(y = Total_Revenue_Per_Capita, color = "Total Revenue Per Capita")) +
  geom_point(aes(y = Total_Revenue_Per_Capita, color = "Total Revenue Per Capita")) +
  geom_line(aes(y = IGF_Per_Capita, color = "IGF Per Capita")) +
  geom_point(aes(y = IGF_Per_Capita, color = "IGF Per Capita")) +
  geom_line(aes(y = DACF_Per_Capita, color = "DACF Per Capita")) +
  geom_point(aes(y = DACF_Per_Capita, color = "DACF Per Capita")) +
  labs(title = "Revenue Per Capita trends", x = "Year", y = "Amount (Ghana Cedis)", color = "Type") +
  scale_y_continuous(labels = comma) 

cor_matrix <- cor(Cleaned_Accra_MMDAs_Data[, c("Population", "Total_Revenue", "Total_Expenditure", "IGF_TE", "IGF")], use = "complete.obs")
print(cor_matrix)
##                   Population Total_Revenue Total_Expenditure      IGF_TE
## Population         1.0000000    0.40021799       0.421275523 0.098553597
## Total_Revenue      0.4002180    1.00000000       0.970273216 0.083922559
## Total_Expenditure  0.4212755    0.97027322       1.000000000 0.001271894
## IGF_TE             0.0985536    0.08392256       0.001271894 1.000000000
## IGF                0.3859802    0.88365092       0.852844940 0.381545990
##                         IGF
## Population        0.3859802
## Total_Revenue     0.8836509
## Total_Expenditure 0.8528449
## IGF_TE            0.3815460
## IGF               1.0000000
corrplot(cor_matrix, main = "Correlation matrix of population and expenditure patterns")

In the above there is a strong positive correlation between total revenue and total expenditure and also between IGF.

2.3.1 Regression Analysis

# Total Revenue vs Population
model_revenue_pop <- lm(Total_Revenue ~ Population, data = Cleaned_Accra_MMDAs_Data)
summary(model_revenue_pop)
## 
## Call:
## lm(formula = Total_Revenue ~ Population, data = Cleaned_Accra_MMDAs_Data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -10217220  -3555800   -479229   2403423  16952672 
## 
## Coefficients:
##                Estimate  Std. Error t value      Pr(>|t|)    
## (Intercept) 6145864.309 1001564.430   6.136 0.00000000934 ***
## Population       25.536       5.278   4.838 0.00000361706 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5173000 on 131 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.1516, Adjusted R-squared:  0.1451 
## F-statistic: 23.41 on 1 and 131 DF,  p-value: 0.000003617
ggplot(Cleaned_Accra_MMDAs_Data, aes(x = Population, y = Total_Revenue)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE) +
  labs(title = "Total Revenue vs Population", x = "Population", y = "Total Revenue") +
  scale_x_continuous(labels = comma) +
  scale_y_continuous(labels = comma)

#  # Total Expenditure vs Population
model_expenditure_pop <- lm(Total_Expenditure ~ Population, data = Cleaned_Accra_MMDAs_Data)
summary(model_expenditure_pop)
## 
## Call:
## lm(formula = Total_Expenditure ~ Population, data = Cleaned_Accra_MMDAs_Data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -10237545  -3697052      1677   2911901  13515110 
## 
## Coefficients:
##                Estimate  Std. Error t value    Pr(>|t|)    
## (Intercept) 5557076.955 1057168.243   5.257 0.000000608 ***
## Population       26.284       5.489   4.789 0.000004627 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5268000 on 126 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.154,  Adjusted R-squared:  0.1473 
## F-statistic: 22.93 on 1 and 126 DF,  p-value: 0.000004627
ggplot(Cleaned_Accra_MMDAs_Data, aes(x = Population, y = Total_Expenditure)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE) +
  labs(title = "Total Expenditure vs Population", x = "Population", y = "Total Expenditure") +
  scale_x_continuous(labels = comma) +
  scale_y_continuous(labels = comma)

# Capital Expenditure vs Total Revenue and IGF_TE
model_capital_rev_igf <- lm(Capital_Expenditure ~ Total_Revenue + IGF_TE, data = Cleaned_Accra_MMDAs_Data)
summary(model_capital_rev_igf)
## 
## Call:
## lm(formula = Capital_Expenditure ~ Total_Revenue + IGF_TE, data = Cleaned_Accra_MMDAs_Data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -4101349  -954608   -37233   693716  9008776 
## 
## Coefficients:
##                     Estimate     Std. Error t value            Pr(>|t|)    
## (Intercept)     637359.33446   406877.83501   1.566               0.120    
## Total_Revenue        0.25936        0.02704   9.592 <0.0000000000000002 ***
## IGF_TE        -1063875.71009   651330.29426  -1.633               0.105    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1684000 on 119 degrees of freedom
##   (12 observations deleted due to missingness)
## Multiple R-squared:  0.4385, Adjusted R-squared:  0.429 
## F-statistic: 46.46 on 2 and 119 DF,  p-value: 0.000000000000001225
ggplot(Cleaned_Accra_MMDAs_Data, aes(x = Total_Revenue, y = Capital_Expenditure)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE) +
  labs(title = "Capital Expenditure vs Total Revenue", x = "Total Revenue", y = "Capital Expenditure") +
  scale_x_continuous(labels = comma) +
  scale_y_continuous(labels = comma)

ggplot(Cleaned_Accra_MMDAs_Data, aes(x = Population, y = IGF_TE)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE) +
  labs(title = "IGF_TE vs Population", x = "Population", y = "IGF_TE") +
  scale_x_continuous(labels = comma) +
  scale_y_continuous(labels = percent_format(accuracy = 1))

# IGF_TE vs Population and Total Revenue
model_igfte_pop_rev <- lm(IGF_TE ~ Population + Total_Revenue, data = Cleaned_Accra_MMDAs_Data)
summary(model_igfte_pop_rev)
## 
## Call:
## lm(formula = IGF_TE ~ Population + Total_Revenue, data = Cleaned_Accra_MMDAs_Data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.34769 -0.13661 -0.02946  0.11525  1.18077 
## 
## Coefficients:
##                     Estimate     Std. Error t value      Pr(>|t|)    
## (Intercept)   0.341040867084 0.054460323270   6.262 0.00000000625 ***
## Population    0.000000224692 0.000000288791   0.778         0.438    
## Total_Revenue 0.000000002155 0.000000004046   0.533         0.595    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2328 on 119 degrees of freedom
##   (12 observations deleted due to missingness)
## Multiple R-squared:  0.01207,    Adjusted R-squared:  -0.004535 
## F-statistic: 0.7269 on 2 and 119 DF,  p-value: 0.4856
ggplot(Cleaned_Accra_MMDAs_Data, aes(x = Total_Revenue, y = IGF_TE)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE) +
  labs(title = "IGF_TE vs Total Revenue", x = "Total Revenue", y = "IGF_TE") +
  scale_x_continuous(labels = comma) +
  scale_y_continuous(labels = percent_format(accuracy = 1))

In the regression results above, we found a significant linear relationship between between Total Revenue and Population, Total Expenditure and Population, and Capital Expenditure, Total Revenue. But there is non-significance between IGF_TE vs Population and Total Revenue.

2.4 What is the relationship between service delivery and revenue and expenditure patterns?

# no variables

2.5 SHEET 3